Fix bug that some requests of express / axios are not close correctly#20
Fix bug that some requests of express / axios are not close correctly#20kezhenxu94 merged 9 commits intomasterfrom
Conversation
Ok, well this is interesting, according to the image you attach the bug is not exactly fixed - note the missing entry spans for the non-error More generally I am worried about HttpPlugin, why was it refactored? Specifically I am worried about the change to use So for HttpPlugin, since I did actually spend a good bit of time making sure the order of execution was good with potential higher level modules and the actual higher level express, I would say that if there was no concrete reason for the refactor of HttpPlugin then that should be reverted, especially because of the potential problematic use of |
Ok, missed that, so my only doubt remains the use of [NOTE]: I am not CERTAIN that |
This make sense to me now, this concern didn't show up until this bug appears.
I can't agree more with the opinion, removing Anyway, I agree to keep dependencies as simple as possible and hence I'll remove this for now |
|
Also, the refactor was to fix the problem that |
Any plugin for a module will be safe to [EDIT] Meaning you can safely use |
Hmm, I had not considered that, how bad is the time difference? [EDIT] And how important is it that the times match? |
By "don't match", I meant for example, the
This effects nearly all time-related metrics, slow services, latency, response time percentiles, etc. |
Just to double check that we are not excluding something unnecessarily I have been looking inside function createListener(msg) {
function listener(err) {
if (msg.__onFinished === listener) msg.__onFinished = null
And I see how the time difference problem is caused in the reverted HttpPlugin, just need to bind |
Yes that is indeed a problem that I didn't realize before, thanks for sharing.
Yep the tests result shows this problem exactly, and that's what I said without P.S. I'm also finding some hooks of import / require that may avoid the problem of the different instances, something like https://stackoverflow.com/questions/24602136/import-hooks-in-node-js , need some tryings though |
Well, the asynchronous nature of JS will make this an interesting problem, especially for async client / servers. Will need to check that time reported is actual time taken and not just to return a Promise which may be similar on both client and server but not be the actual time of processing. As for http I am sure there is a way to do it, I may just duplicate part of
I don't think the "problem" of different instances is something you want to "solve" globally. It is just how Node works and it is something to be aware of, but if you try to change it globally you may wind up breaking things you don't even know about. Also, does |
Exactly.
Hopefully this can be extracted as an individual reusable module / class. Also note not to copy the codes from on-finished because of licensing issues.
This is the last sort if it really needs. Just to search and see we have a fallback solution. |
Actually not even extract anything, I think its just adding a listener on request socket error or close and removing the
What problem are you actually trying to solve? Do you want to hook BTW: I don't know why the tests errored here, my test is fine with your latest push. |
Actually its as simple as removing request.on('close', stopIfNotStopped);
request.on('abort', () => (span.errored = true, stopIfNotStopped()));
request.on('error', (err) => (span.error(err), stopIfNotStopped()));It is not perfect yet since |
Removed, let's see what the test result looks like |
The client But yes, this is going to be a tricky issue in other plugins as well, hooking into the correct events and making sure of consistency. |
|
The differences here in times are 6 and 25 ms. (ms resolution right? or ns?) Possibly lag due to event loop - data may all be sent and client may receive it and mark as complete but close event on server may not tick until after that. Or maybe still needs some work where to mark end in server. Run my test with this and have a look, it all "looks" more or less right, but again more detailed checking needed. Changing the |
|
The tests failed because of |
Really? So did I misunderstand it? Because to me it appears it is the end times which were off? And yes if it really is |
|
Hmm, I'm rechecking the expected data. |
Looking at the code, the start times can not possibly be out of order, that would violate the laws of physics (causality, server span will not start unless the client span started and requested). |
My eyes blurred and mistook the two spans relationship, should be OK now, thank you Tom :) |
|
I have another idea of what this might be, server doesn't close span until it gets confirmation from client that connection closed, in which case client would have closed span BEFORE server causing server to be a few ms. late. |
Nope, running a simple test locally server span is stopped before client span. Can you confirm for me that I am chasing the correct problem, that server span end time is sometimes recorded a few ms. (I get around 20) after client span? |
As a reminder, please check the intercepting points, if you fetch the timestamp of responding the first byte at server side, and read the timestamp of last byte received, the duration could be a negative value, due to the transport of the whole body needs time. |
What I meant is that I am not sure I am interpreting the test failure correctly, that it is due to server span end time AFTER client span end time. |
|
I know. I didn't check the codes, but my point is, it could be, if the timestamp you chosen as start and end time. |
I think the |
|
@tom-pytel I think the tests all passed, and I run your tests locally, which seems to work well too, can you please take a look at the codes and verify the aforementioned tests if needed. Thanks BTW, Happy New Year 🎉 !! |
Thx, you too. Checked on my side with my tests and is ok. I have one issue though, why did you remove The purpose of If this mechanism is removed then if there is something that creates a span between when |
Yes that was what I would ask you to have a double check.
From what I've tested, if you remove itself (by segmentItems:
- serviceName: server
segmentSize: 1
segments:
- segmentId: 99239d3627cc45c891da83ad70cc2ed2
spans:
- operationName: /axios
operationId: 0
parentSpanId: -1
spanId: 0
spanLayer: Http
startTime: 1609509943404
endTime: 1609509943412
componentId: 49
isError: false
spanType: Entry
peer: server:5000
skipAnalysis: false
tags:
- {key: http.url, value: 'server:5000/axios'}
- {key: http.status.code, value: '200'}
refs:
- {parentEndpoint: '', networkAddress: 'server:5000', refType: CrossProcess,
parentSpanId: 2, parentTraceSegmentId: 2417bad1936c4ee88df68a619d534d0f, parentServiceInstance: ae93997659b8,
parentService: client, traceId: 551f3481f8fc4a84aa262455391e2f43}
- operationName: /json
operationId: 0
parentSpanId: 0
spanId: 1
spanLayer: Http
startTime: 1609509943407
endTime: 1609509944263
componentId: 4005
isError: false
spanType: Exit
peer: httpbin.org
skipAnalysis: false
tags:
- {key: http.url, value: httpbin.org/json}
- {key: http.status.code, value: '200'}
- {key: http.status.msg, value: OK}
- operationName: /json
operationId: 0
parentSpanId: -1
spanId: 2
spanLayer: Http
startTime: 1609509943415
endTime: 1609509944268
componentId: 2
isError: false
spanType: Exit
peer: httpbin.org:80
skipAnalysis: false
tags:
- {key: http.url, value: 'httpbin.org:80/json'}
- {key: http.status.code, value: '200'}
- {key: http.status.msg, value: OK}
- serviceName: client
segmentSize: 1
segments:
- segmentId: 2417bad1936c4ee88df68a619d534d0f
spans:
- operationName: /axios
operationId: 0
parentSpanId: -1
spanId: 0
spanLayer: Http
startTime: 1609509943388
endTime: 1609509943392
componentId: 49
isError: false
spanType: Entry
peer: localhost:5001
skipAnalysis: false
tags:
- {key: http.url, value: 'localhost:5001/axios'}
- {key: http.status.code, value: '200'}
- operationName: /axios
operationId: 0
parentSpanId: 0
spanId: 1
spanLayer: Http
startTime: 1609509943390
endTime: 1609509944272
componentId: 4005
isError: false
spanType: Exit
peer: server:5000
skipAnalysis: false
tags:
- {key: http.url, value: 'server:5000/axios'}
- {key: http.status.code, value: '200'}
- {key: http.status.msg, value: OK}
- operationName: /axios
operationId: 0
parentSpanId: -1
spanId: 2
spanLayer: Http
startTime: 1609509943395
endTime: 1609509944276
componentId: 2
isError: false
spanType: Exit
peer: server:5000
skipAnalysis: false
tags:
- {key: http.url, value: 'server:5000/axios'}
- {key: http.status.code, value: '200'}
- {key: http.status.msg, value: OK}Note the 2 exit spans for the same, one from
Seems the correct Unless there is another bug or the |
Hmm, this is what I meant when I said the interaction with the |
Sounds good to me |
|
So I think this is good to merge for now right? @tom-pytel |
|
Looks good to go then, will you do a 0.1.1 fix release? |
I'd like to have minor release, or if we could have the axios fixed within a reasonable time, I can also wait for that, WDYT? |
I have an idea on how to tackle the axios thing but working on something else at the moment so may be a couple of days. In the meantime if you want you could check just how much of a problem it could be in the real world (it is a problem in theory but it may not manifest), the simple way would be to do a couple of hundred concurrent requests and see if there is any parent / child anomaly. |
Thanks for the clue, I may take a look maybe several days later cause' I'm also doing something else. Thanks again for the long conversations! Merging. |
|
Ok, well just out of curiosity and since it was a simple tweak I checked this and unfortunately the problem is not theoretical but actually fairly bad if there is no async / resync. If you are curious to see what happens the await Promise.all(new Array(100).fill().map(() => downloadPageA('localhost', 8100, '/'))); |
I'll also have a try (maybe tomorrow) |


Closes apache/skywalking#6109
@tom-pytel this seems to be a problem caused by the removal of
topModule.require, I removed it because of the reason I stated in #13 (comment) , so again I added it back in this PR with a little refactor for both (env var andtopModule.require), please take a look